228 research outputs found
A Bayesian encourages dropout
Dropout is one of the key techniques to prevent the learning from
overfitting. It is explained that dropout works as a kind of modified L2
regularization. Here, we shed light on the dropout from Bayesian standpoint.
Bayesian interpretation enables us to optimize the dropout rate, which is
beneficial for learning of weight parameters and prediction after learning. The
experiment result also encourages the optimization of the dropout
Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
We propose a new regularization method based on virtual adversarial loss: a
new measure of local smoothness of the conditional label distribution given
input. Virtual adversarial loss is defined as the robustness of the conditional
label distribution around each input data point against local perturbation.
Unlike adversarial training, our method defines the adversarial direction
without label information and is hence applicable to semi-supervised learning.
Because the directions in which we smooth the model are only "virtually"
adversarial, we call our method virtual adversarial training (VAT). The
computational cost of VAT is relatively low. For neural networks, the
approximated gradient of virtual adversarial loss can be computed with no more
than two pairs of forward- and back-propagations. In our experiments, we
applied VAT to supervised and semi-supervised learning tasks on multiple
benchmark datasets. With a simple enhancement of the algorithm based on the
entropy minimization principle, our VAT achieves state-of-the-art performance
for semi-supervised learning tasks on SVHN and CIFAR-10.Comment: To be appeared in IEEE Transactions on Pattern Analysis and Machine
Intelligenc
Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias
A common strategy for sparse linear regression is to introduce
regularization, which eliminates irrelevant features by letting the
corresponding weights be zeros. However, regularization often shrinks the
estimator for relevant features, which leads to incorrect feature selection.
Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a
sparse estimation method which imposes no regularization on the weights. The
key concept of BM is to introduce binary latent variables that randomly mask
features. Estimating the masking rates determines the relevance of the features
automatically. We derive a variational Bayesian inference algorithm that
maximizes the lower bound of the factorized information criterion (FIC), which
is a recently developed asymptotic criterion for evaluating the marginal
log-likelihood. In addition, we propose reparametrization to accelerate the
convergence of the derived algorithm. Finally, we show that BM outperforms
Lasso and automatic relevance determination (ARD) in terms of the
sparsity-shrinkage trade-off
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood
Factorized information criterion (FIC) is a recently developed approximation
technique for the marginal log-likelihood, which provides an automatic model
selection framework for a few latent variable models (LVMs) with tractable
inference algorithms. This paper reconsiders FIC and fills theoretical gaps of
previous FIC studies. First, we reveal the core idea of FIC that allows
generalization for a broader class of LVMs, including continuous LVMs, in
contrast to previous FICs, which are applicable only to binary LVMs. Second, we
investigate the model selection mechanism of the generalized FIC. Our analysis
provides a formal justification of FIC as a model selection criterion for LVMs
and also a systematic procedure for pruning redundant latent variables that
have been removed heuristically in previous studies. Third, we provide an
interpretation of FIC as a variational free energy and uncover a few
previously-unknown their relationships. A demonstrative study on Bayesian
principal component analysis is provided and numerical experiments support our
theoretical results
Semi-supervised learning of hierarchical representations of molecules using neural message passing
With the rapid increase of compound databases available in medicinal and
material science, there is a growing need for learning representations of
molecules in a semi-supervised manner. In this paper, we propose an
unsupervised hierarchical feature extraction algorithm for molecules (or more
generally, graph-structured objects with fixed number of types of nodes and
edges), which is applicable to both unsupervised and semi-supervised tasks. Our
method extends recently proposed Paragraph Vector algorithm and incorporates
neural message passing to obtain hierarchical representations of subgraphs. We
applied our method to an unsupervised task and demonstrated that it outperforms
existing proposed methods in several benchmark datasets. We also experimentally
showed that semi-supervised tasks enhanced predictive performance compared with
supervised ones with labeled molecules only.Comment: 8 pages, 2 figures. Appeared as a poster presentation in workshop on
Machine Learning for Molecules and Materials in NIPS 201
Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular Graph Analysis
Graph Neural Network (GNN) is a popular architecture for the analysis of
chemical molecules, and it has numerous applications in material and medicinal
science. Current lines of GNNs developed for molecular analysis, however, do
not fit well on the training set, and their performance does not scale well
with the complexity of the network. In this paper, we propose an auxiliary
module to be attached to a GNN that can boost the representation power of the
model without hindering with the original GNN architecture. Our auxiliary
module can be attached to a wide variety of GNNs, including those that are used
commonly in biochemical applications. With our auxiliary architecture, the
performances of many GNNs used in practice improve more consistently, achieving
the state-of-the-art performance on popular molecular graph datasets.Comment: Augmented experiments, title slightly modifie
Neural Sequence Model Training via -divergence Minimization
We propose a new neural sequence model training method in which the objective
function is defined by -divergence. We demonstrate that the objective
function generalizes the maximum-likelihood (ML)-based and reinforcement
learning (RL)-based objective functions as special cases (i.e., ML corresponds
to and RL to ). We also show that the gradient of
the objective function can be considered a mixture of ML- and RL-based
objective gradients. The experimental results of a machine translation task
show that minimizing the objective function with outperforms
, which corresponds to ML-based methods.Comment: 2017 ICML Workshop on Learning to Generate Natural Language (LGNL
2017
Robustness to Adversarial Perturbations in Learning from Incomplete Data
What is the role of unlabeled data in an inference problem, when the presumed
underlying distribution is adversarially perturbed? To provide a concrete
answer to this question, this paper unifies two major learning frameworks:
Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL). We
develop a generalization theory for our framework based on a number of novel
complexity measures, such as an adversarial extension of Rademacher complexity
and its semi-supervised analogue. Moreover, our analysis is able to quantify
the role of unlabeled data in the generalization under a more general condition
compared to the existing theoretical works in SSL. Based on our framework, we
also present a hybrid of DRL and EM algorithms that has a guaranteed
convergence rate. When implemented with deep neural networks, our method shows
a comparable performance to those of the state-of-the-art on a number of
real-world benchmark datasets.Comment: 41 pages, 9 figure
Neural Multi-scale Image Compression
This study presents a new lossy image compression method that utilizes the
multi-scale features of natural images. Our model consists of two networks:
multi-scale lossy autoencoder and parallel multi-scale lossless coder. The
multi-scale lossy autoencoder extracts the multi-scale image features to
quantized variables and the parallel multi-scale lossless coder enables rapid
and accurate lossless coding of the quantized variables via encoding/decoding
the variables in parallel. Our proposed model achieves comparable performance
to the state-of-the-art model on Kodak and RAISE-1k dataset images, and it
encodes a PNG image of size in 70 ms with a single GPU and a
single CPU process and decodes it into a high-fidelity image in approximately
200 ms.Comment: 15 pages, 15 figure
DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback
Exploration has been one of the greatest challenges in reinforcement learning
(RL), which is a large obstacle in the application of RL to robotics. Even with
state-of-the-art RL algorithms, building a well-learned agent often requires
too many trials, mainly due to the difficulty of matching its actions with
rewards in the distant future. A remedy for this is to train an agent with
real-time feedback from a human observer who immediately gives rewards for some
actions. This study tackles a series of challenges for introducing such a
human-in-the-loop RL scheme. The first contribution of this work is our
experiments with a precisely modeled human observer: binary, delay,
stochasticity, unsustainability, and natural reaction. We also propose an RL
method called DQN-TAMER, which efficiently uses both human feedback and distant
rewards. We find that DQN-TAMER agents outperform their baselines in Maze and
Taxi simulated environments. Furthermore, we demonstrate a real-world
human-in-the-loop RL application where a camera automatically recognizes a
user's facial expressions as feedback to the agent while the agent explores a
maze
- …